Providing Internet Access to Portuguese Corpora: the AC/DC Project
نویسندگان
چکیده
In this paper we report on the activity of the project Computational Processing of Portuguese (Processamento computacional do português) in what concerns providing access to Portuguese corpora through the Internet. One of its activities, the AC/DC project (Acesso a corpora/Disponibilização de Corpora, roughly "Access and Availability of Corpora") allows a user to query around 40 million words of Portuguese text. After describing the aims of the service, which is still being subject to regular improvements, we focus on the process of tagging and parsing the underlying corpora, using a Constraint Grammar parser for Portuguese.
منابع مشابه
Corpora at Linguateca: Vision and roads taken
In the late nineties, access to Portuguese data in electronic form was scarce, and was considered one of the bottlenecks limiting the advance of natural language processing of Portuguese (Santos, 1999a), so Linguateca’s launching of AC/DC i had as purpose to significantly increase the amount of data – and its quality, in that the data was annotated and classified. To the best of my knowledge, A...
متن کاملExperiments in Human-computer Cooperation for the Semantic Annotation of Portuguese Corpora
In this paper, we present a system to aid human annotation of semantic information in the scope of the project AC/DC, called corte-e-costura. This system leverages on the human annotation effort, by providing the annotator with a simple system that applies rules incrementally. Our goal was twofold: first, to develop an easy-to-use system that required a minimum of learning from the part of the ...
متن کاملLinguateca's infrastructure for Portuguese and how it allows the detailed study of language varieties
In this paper I present briefly Linguateca, an infrastructure project for Portuguese which is ten years old, and will show how it provides several possibilities to study grammatical and semantical differences between varieties of the language. After a short history of Portuguese corpus linguistics, presenting the main projects in the area, I discuss in some detail the AC/DC project (Santos & Bi...
متن کاملProviding On-line Access to Portuguese Language Resources: Corpora and Lexicons
Several Language Resources (LRs) for Portuguese, developed at the Center of Linguistics of the Lisbon University (CLUL), are available on-line at CLUL’s webpage: www.clul.ul.pt/english/sectores/projecto_rld.html. These LRs have been extracted from or developed based on the Reference Corpus of Contemporary Portuguese (CRPC), a monitor corpus containing, at the present, more than 300 million word...
متن کامل)oruhvwd6lqwifwlfd$wuhhedqniru3ruwxjxhvh 6xvdqd$irqvr Blockin(fnkdug%lfn Blockin5hqdwr+dehu 'ldqd6dqwrv ,qwurgxfwlrq0rwlydwlrqdqgremhfwlyhv
$EVWUDFW This paper reviews the first year of the creation of a publicly available treebank for Portuguese, Floresta Sintá(c)tica, a collaboration project between the VISL and the Computational Processing of Portuguese projects. After briefly describing the main goals and the organization of the project, the creation of the annotated objects is presented in detail: preparing the text to be anno...
متن کامل